title: Hitchhiker's guide to data formats
date: 2015-10-21 09:45
author: Christine Lemmer-Webber
tags: data, serialization, foss
slug: hitchhikers-guide-to-data-formats
---
Just thinking out loud this morning on what data formats there are and
how they work with the world:

-   [XML](https://en.wikipedia.org/wiki/XML): 2000's hippest technology.
    Combines a clear, parsable tree based syntax with extension
    mechanisms and a schema system. Still moderately popular, though not
    as it once was. Tons of tooling. Many seem to think the tooling
    makes it overly complex, and JSON has taken over much of its place.
    Has the advantage of unambiguity over vanilla JSON, if you know how
    to use it right, but more effort to work with.
-   [SGML](https://en.wikipedia.org/wiki/Standard_Generalized_Markup_Language):
    XML's soupier grandmother. Influential.
-   [HTML](https://en.wikipedia.org/wiki/S-expression): Kind of like
    SGML and XML but for some specific data. Too bad XHTML never
    fulfilled its dream. Without XHTML, it's even soupier than SGML, but
    there's enough tooling for soup-processing that most developers
    don't worry about it.
-   [JSON](https://en.wikipedia.org/wiki/JSON): Also tree-based, but
    keeps things minimal, *just* your basic types. Loved by web
    developers everywhere. Also ambiguous since on its own, it's
    schema-free... this may lead to conflicts between applications. But
    if you know the source and the destination perfectly it's fine. Has
    the advantage of transforming into basic types in pretty much every
    language and widespread tooling. (Don't be evil about being evil,
    though? #vaguejokes) If you want to send JSON between a lot of
    locations and want to be unambiguous in your meaning, or if you want
    more than just the basic types provided, you're going to need
    something more... we'll come to that in a bit.
-   [S-expressions](https://en.wikipedia.org/wiki/S-expression): the
    language of lisp, and lispers claim you can represent anything as
    s-expressions, which is true, but also that's kind of ambiguous on
    its own. Capable also of representing code just as well, which is
    why lispers claim benefits of symmetry and "code that can write
    code". However, serializing "pure data" is also perfectly possible
    with s-expressions. So many variations between languages though...
    it's more of a "generalized family" or even better, a pattern, of
    data (and code) formats. Some [damn
    cool](http://www.more-magic.net/posts/lispy-dsl-sxml.html)
    representations of some of these other formats via sexps. Some
    people get scared away by all the parens, though, which is too bad,
    because (though this strays into code + data, not just data)
    [homoiconicity](https://en.wikipedia.org/wiki/Homoiconicity) can't
    be beat. (Maybe
    [Wisp](http://dustycloud.org/blog/wisp-lisp-alternative/) can help
    there?)
-   [Canonical
    s-expressions](https://en.wikipedia.org/wiki/Canonical_S-expressions):
    S-expressions, with a canonical representation... cool! Most
    developers don't know about it, but was designed for public key
    cryptography usage, and still actively used there (libgcrypt uses
    canonical s-expressions under the hood, for instance). No schema
    system, and actually pretty much just lists and binary strings, but
    the binary strings can be marked with "display hints" so systems can
    know how to unpack the data into appropriate types.
-   [RDF](https://en.wikipedia.org/wiki/Resource_Description_Framework)
    and friends: The "unicode" of graph-oriented data. Not a
    serialization itself, but a specification on the conceptual modeling
    of data, and you'll hear "linked data" people talking about it a
    lot. A graph of "subject, predicate, object" triples. Pretty cool
    once you learn what it is, though the introductory material is
    really overwhelming. (Also, good luck representing [ordered
    lists](http://www.snee.com/bobdc.blog/2014/04/rdf-lists-and-sparql.html)).
    However, there *is no one* serialization of RDF, which leads to much
    confusion among many developers (including myself, while being
    explained to the contrary, for a long time). For example,
    [rdf/xml](http://www.w3.org/TR/rdf-syntax-grammar/) looks like XML,
    but woe be upon ye who uses XML tooling upon it. So, deserialzie to
    RDF, then deal with RDF in RDF land, then serialize again... that's
    the way to go with RDF. Has more sane formats than just rdf/xml, for
    example [Turtle](https://en.wikipedia.org/wiki/Turtle_%28syntax%29)
    is easy to read. RDF community seems to get mad when you want to
    interpret data as anything other than RDF, which can be very
    off-putting, though the goal of a "platonic form" of data is highly
    admirable. That said, graph based tooling is definitely harder for
    most developers to work with than tree-based tooling, but hopefully
    "the jQuery of RDF" library will become available some day, and
    things will be easier. Interesting stuff to learn, anyway!
-   [json-ld](http://json-ld.org/): A "linked data format", technically
    can transform itself into RDF, but unlike other forms of RDF syntax,
    can often be parsed just on its own as simple JSON. So, say you want
    to have JSON and keep things easy for most of your users who just
    use their favorite interpreted language to extract key value pairs
    from your API. Okay, no problem for them! But suddenly you're also
    consuming JSON from multiple origins, and one of them uses "run" to
    say "run a mile" whereas your system uses "run" to mean "run a
    program". How do you tell these apart? With json-ld you can "expand"
    a JSON representation with supplied context to an unambiguous form,
    and you can "compact" it down again to the terms you know and
    understand in your system, leaving out those you don't. No more
    executing a program for a mile!
-   [Microformats](http://microformats.org/) and
    [RDFa](http://rdfa.info/): Two communities which are notoriously and
    exasperatingly at odds with each other for over a decade, so why do
    I link them together? Well, both of these take the same approach of
    embedding data in HTML. Great when you have HTML for your data to go
    with, though not all data needs an HTML wrapper. But it's good to be
    able to extract it! RDFa simply extracts to RDF, which we've
    discussed plenty; Microformats extracts to its own thing. Frequent
    form of contention between these groups is about vocabulary, and how
    to represent vocabulary. RDFa people like their vocabulary to have
    canonical URIs for each term (well, that's an RDF thing, so not
    surprising), Microformats people like to document everything in a
    wiki. Arguments about extensibility is a frequent topic... if you
    want to get into that, see [Amy Guy's summary of
    things](http://rhiaro.co.uk/2015/08/extensibility).

Of course, there's more data formats than that. Heck, even on top of
these data formats there's a lot more out there (these days I spend a
lot of time working on [ActivityStreams
2.0](http://www.w3.org/TR/activitystreams-core/) related tooling, which
is just JSON with a specific structure, until you want to get fancier,
add extensions, or jump into linked data land, in which case you can
process it as json-ld). And maybe you'd also find stuff like [Cap'n
Proto](https://capnproto.org/) or [Protocol
Buffers](https://developers.google.com/protocol-buffers/) to be
interesting. But the above are the formats that, today, I think are
generally most interesting or impactful upon my day to day work. I hope
this guide was interesting to you!
